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COPYRIGHT NOTICE 



A portion of the disclosure of this patent document 



contains material which is subject to copyright protection. 
The copyright owner has no objection to the xeroxographic 
reproduction by anyone of the patent document or the patent 
disclosure in exactly the form it appears in the Patent and 
Trademark Office patent file or records, but otherwise 
reserves all copyright rights whatsoever. 

MICROFICHE APPENDIX 
Microfiche Appendices A to E comprising five (5) 
sheets, totaling 272 frames are included herewith. 

GOVERNMENT RIGHTS NOTICE 
Portions of the material in this specification arose 
in the course of or under contract nos. 92ER81275 (SBIR) 
between Affymetrix, Inc. and the Department of Energy and/or 
H600813-1, -2 between Affymetrix, Inc. and the National 
Institutes of Health. 



computer systems. More specifically, the present invention 
relates to computer systems for visualizing biological 
sequences, as well as for evaluating and comparing biological 
sequences. 



arrays of materials on a substrate are known. For example, 
PCT application WO92/10588, incorporated herein by reference 
for all purposes, describes techniques for sequencing or 
sequence checking nucleic acids and other materials. Arrays 
for performing these operations may be formed in arrays 
according to the methods of, for example, the pioneering 



BACKGROUND OF THE INVENTION 



The present invention relates to the field of 



Devices and computer systems for forming and using 
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techniques disclosed in U.S. Patent No. 5,143,854 and U.S. 
Patent Application No. 08/249,188, both incorporated herein by 
reference for all purposes. 

According to one aspect of the techniques described 
5 therein, an array of nucleic acid probes is fabricated at 
known locations on a chip or substrate. A f luorescently 
labeled nucleic acid is then brought into contact with the 
chip and a scanner generates an image file indicating the 
locations where the labeled nucleic acids bound to the chip. 

10 Based upon the identities of the probes at these locations, it 
becomes possible to extract information such as the monomer 
sequence of DNA or RNA. Such systems have been used to form, 
for example, arrays of DNA that may be used to study and 
detect mutations relevant to cystic fibrosis, the P53 gene 

15 (relevant to certain cancers) , HIV, and other genetic 
characteristics . 

Improved computer systems and methods are needed to 
evaluate, analyze , and process the vast amount of information 
now used and made available by these pioneering technologies. 

20 

SUMMARY OF THE INVENTION 
An improved computer-aided system for visualizing 
and determining the sequence of nucleic acids is disclosed. 
The computer system provides, among other things, improved 
25 methods of analyzing fluorescent image files of a chip 

containing hybridized nucleic acid probes in order to call 
bases in sample nucleic acid sequences. 

According to one aspect of the invention, a computer 
system is used to identify an unknown base in a sample nucleic 
3 0 acid sequence by the steps of: 

- inputting multiple probe intensities, each of the 
probe intensities being associated with a probe; 

- the computer system comparing the multiple probe 
intensities where each of the probe intensities is 

35 substantially proportional to a probe hybridizing 

with at least one sequence; and 
calling the unknown base according to the comparison of the 
multiple probe intensities. 
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According to one specific aspect of the invention, a 
higher probe intensity is compared to a lower probe intensity 
to call the unknown base. According to another specific 
aspect of the invention, probe intensities of a sample 
5 sequence are compared to probe intensities of a reference 
sequence. According to yet another specific aspect of the 
invention, probe intensities of a sample sequence are compared 
to statistics about probe intensities of a reference sequence 
from multiple experiments. 
10 According to another aspect of the invention, a 

method is disclosed of processing reference and sample nucleic 
acid sequences to reduce the variations between the 
experiments by the steps of: 

- providing a plurality of nucleic acid probes; 

15 - labeling the reference nucleic acid sequence with 

a first marker; 

- labeling the sample nucleic acid sequence with a 
second marker; and 

hybridizing the labeled reference and sample nucleic acid 

2 0 sequences at the same time. 

According to yet another aspect of the invention, a 
computer system is used for comparative analysis and 
visualization of multiple sequences by the steps of: 

- displaying at least one reference sequence in a 
25 first area on a display device; and 

- displaying at least one sample Q sequence in a 
second area on said display device; 

whereby a user is capable of visually comparing the multiple 
sequences . 

3 0 A further understanding of the nature and advantages 

of the inventions herein may be realized by reference to the 
remaining portions of the specification and the attached 
drawings . 
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BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 illustrates an overall system for forming and 
analyzing arrays of biological materials such as DNA or RNA ; 

Fig. 2A is an illustration of the software for the 
overall system; Fig. 2B illustrates the global layout of a 
chip formed in the overall system; and Fig. 2C illustrates 
conceptually the binding of probes on chips; 

Fig. 3 illustrates the high level flow of the 
intensity ratio method; 

Fig. 4A illustrates the high level flow of one 
implementation of the reference method and Fig. 4B shows an 
analysis table for use with the reference method; 

Fig. 5A illustrates the high level flow of another 
implementation of the reference method; Fig. 5B shows a data 
table for use with the reference method; Fig. 5C shows a graph 
of the normalized sample base intensities minus the normalized 
reference base intensities; and Fig. 5D shows other graphs of 
data in the data table; 

Fig. 6 illustrates the high level flow of the 
statistical method; 

Fig. 7 illustrates the pooling processing of a 
reference and sample nucleic acid sequence; 

Fig. 8 illustrates the main screen and the 
associated pull down menus for comparative analysis and 
visualization of multiple experiments; 

Fig. 9 illustrates an intensity graph window for a 
selected base; 

Fig. 10 illustrates multiple intensity graph windows 
for selected bases; 

Fig. 11 illustrates the intensity ratio method 
correctly calling a mutation in solutions with varying 
concentrations ; 

Fig. 12 illustrates the reference method correctly 
calling a mutant base where the intensity ratio method 
incorrectly called the mutant base; and \j\tLWS£Q™ 

Fig. 13 illustrates the output of the ViewSeq™ 
program with four pretreatment samples and four posttreatment 
samples. 
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I. General 

The present invention provides methods of analyzing 

15 hybridization intensity files for a chip containing hybridized 
nucleic acid probes. In a representative embodiment, the 
files represent fluorescence data from a biological array, but 
the files may also represent other data such as radioactive 
intensity data. For purposes of illustration, the present 

20 invention is described as being part of a computer system that 
designs a chip mask, synthesizes the probes on the chip, 
labels the nucleic acids, and scans the hybridized nucleic 
acid probes. Such a system is fully described in U.S. Patent 
Application No. 08/249,188 which has been incorporated by 

25 reference for all purposes. However, the present invention 
may be used separately from the overall system for analyzing 
data generated by such systems. 

Fig. 1 illustrates a computerized system for forming 
and analyzing arrays of biological materials such as RNA or 

30 DNA. A computer 100 is used to design arrays of biological 
polymers such as RNA or DNA. The computer 100 may be, for 
example, an appropriately programmed Sun Workstation or 
personal computer or workstation, such as an IBM PC 
equivalent, including appropriate memory and a CPU. The 

3 5 computer system 100 obtains inputs from a user regarding 
characteristics of a gene of interest, and other inputs 
regarding the desired features of the array. Optionally, the 
computer system may obtain information regarding a specific 
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genetic sequence of interest from an external or internal 
database 102 such as GenBank. The output of the computer 
system 100 is a set of chip design computer files 104 in the 
form of, for example, a switch matrix, as described in PCT 
5 application WO 92/10092, and other associated computer files. 

The chip design files are provided to a system 106 
that designs the lithographic masks used in the fabrication of 
arrays of molecules such as DNA. The system or process 106 
may include the hardware necessary to manufacture masks 110 

10 and also the necessary computer hardware and software 108 
necessary to lay the mask patterns out on the mask in an 
efficient manner. As with the other features in Fig. 1, such 
equipment may or may not be located at the same physical site, 
but is shown together for ease of illustration in Fig. 1. The 

15 system 106 generates masks 110 or other synthesis patterns 
such as chrome-on-glass masks for use in the fabrication of 
polymer arrays. 

The masks 110, as well as selected information 
relating to the design of the chips from system 100, are used 

20 in a synthesis system 112. Synthesis system 112 includes the 
necessary hardware and software used to fabricate arrays of 
polymers on a substrate or chip 114. For example, synthesizer 
112 includes a light source 116 and a chemical flow cell 118 
on which the substrate or chip 114 is placed. Mask 110 is 

25 placed between the light source and the substrate/chip, and 
the two are translated relative to each other at appropriate 
times for deprotection of selected regions of the chip. 
Selected chemical reagents are directed through flow cell 118 
for coupling to deprotected regions, as well as for washing 

3 0 and other operations. All operations are preferably directed 
by an appropriately programmed computer 119, which may or may 
not be the same computer as the computer (s) used in mask 
design and mask making. 

The substrates fabricated by synthesis system 112 

35 are optionally diced into smaller chips and exposed to marked 
receptors. The receptors may or may not be complementary to 
one or more of the molecules on the substrate. The receptors 
are marked with a label such as a fluorescein label (indicated 
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by an asterisk in Fig* 1) and placed in scanning system 120. 
Scanning system 120 again operates under the direction of an 
appropriately programmed digital computer 122, which also may 
or may not be the same computer as the computers used in 
synthesis, mask making, and mask design. The scanner 120 
includes a detection device 124 such as a confocal microscope 
CCEf (charge-coupled device) that is used to detect the 



or ccq (c 
lo cat i on 



fa^, loc a ti on where labeled receptor (*) has bound to the 

substrate. The output of scanner 120 is an image file(s) 124 
10 indicating, in the case of fluorescein labeled receptor, the 
fluorescence intensity (photon counts or other related 
measurements, such as voltage) as a function of position on 
the substrate. Since higher photon counts will be observed 
where the labeled receptor has bound more strongly to the 
15 array of polymers, and since the monomer sequence of the 

polymers on the substrate is known as a function of position, 
it becomes possible to determine the sequence (s) of polymer (s) 
on the substrate that are complementary to the receptor. 

The image file 124 is provided as input to an 
20 analysis system 126 that incorporates the visualization and 
analysis methods of the present invention. Again, the 
analysis system may be any one of a wide variety of computer 
system (s) , but in a preferred embodiment the analysis system 
is based on a Sun Workstation or equivalent. The present 
25 invention provides various methods of analyzing the chip 

design files and the image files, providing appropriate output 
128. The present invention may further be used to identify 
specific mutations in a receptor such as DNA or RNA. 

Fig. 2A provides a simplified illustration of the 
3 0 overall software system used in the operation of one 

embodiment of the invention. As shown in Fig. 2A, the system 
first identifies the genetic sequence (s) or targets that would 
be of interest in a particular analysis at step 2 02. The 
sequences of interest may, for example, be normal or mutant 
35 portions of a gene, genes that identify heredity, or provide 
forensic information. Sequence selection may be provided via 
manual input of text files or may be from external sources 
such as GenBank. At step 2 04 the system evaluates the gene to 
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determine or assist the user in determining which probes would 
be desirable on the chip, and provides an appropriate "layout" 
on the chip for the probes. A wild-type probe is a probe that 
will ideally hybridize with the gene of interest and thus a 
5 wild-type gene (also called the chip wild-type) would ideally 
hybridize with all the wild-type probes on the chip. The 
layout implements desired characteristics such as arrangement 
on the chip that permits "reading" of genetic sequence and/or 
minimization of edge effects, ease of synthesis, and the like. 
10 Fig. 2B illustrates the global layout of a chip. 

Chip 114 is composed of multiple units where each unit may 
contain different tilings for the chip wild-type sequence. 
Unit 1 is shown in greater detail and shows that each unit is 
composed of multiple cells which are areas on the chip that 
15 may contain probes. Conceptually, each unit is composed of 

multiple sets of related cells. As used herein, the term cell 
refers to a region on a substrate that contains many copies of 
a molecule or molecules of interest. Each unit is composed of 
multiple cells that may be placed in rows and columns. In one 
20 embodiment, a set of five related cells includes the 

following: a wild-type cell 220, "mutation" cells 222, and a 
"blank" cell 224. Cell 220 contains a wild-type probe that is 
the complement of a portion of the wild-type sequence. Cells 
222 contain "mutation" probes for the wild-type sequence. For 
25 example, if the wild-type probe is 3'-ACGT, the probes 3'- 
ACAT, 3 f -ACCT, 3 1 — ACGT , and 3 1 -ACTT may be the "mutation" 
probes. Cell 224 is the "blank" cell because it contains no 
probes (also called the "blank" probe) . As the blank cell 
contains no probes, labeled receptors should not bind to the 
3 0 chip in this area. Thus, the blank cell provides an area that 
can be used to measure the background intensity. 

In one embodiment, numerous tiling processes are 
available including sequence>tLling, block tiling, and opt- 




« iSSEflSSl imp^ti^* 100 * tiling ' and opt " 

- — Of cour se, a wide rq(hge of layout strategies may be 



35 used according to the invention herein, without departing from 
the scope of the invention. For example, the probes may be 
tiled on a substrate in an apparently random fashion where a 
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computer system is utilized to keep track of the probe 
locations and correlate the data obtained from the substrate. 

Opt-tiling is the process of tiling additional 
probes for suspected- mutations. As a simple example of opt- 
tiling, suppose the wild-type target sequence is 5 1 -ACGTATGCA- 
3 • and it is suspected that a mutant sequence has a possible T 
base mutation at the underlined base position. Suppose 
further that the chip will be synthesized with a "4x3" tiling 
strategy, meaning that probes of four monomers are used and 
that the monomers in position 3, counting left to right, of 
the probe are varied. 

In opt-tiling, extra probes are tiled for each 
suspected mutation. The extra probes are tiled as if the 
mutation base is a wild-type base. The following shows the 
probes that may be generated for this example: 



Table 1 

Probe Sequences (From 3 '-end) 
4x3 Qpt-Tilinq 



Wild 


TGCA 


GCAT 


CATA 


ATAC 


TACG 


A sub. 


TGAA 


GCAT 


CAAA 


ATAC 


TAAG 


C sub. 


TGCA 


GCCT 


CACA 


ATCC 


TACG 


6 sub. 


TGGA 


GCGT 


CAGA 


ATGC 


TAGG 


T sub. 


TGTA 


GCTT 


CATA 


ATTC 


TATG 


Wild 


TGCA 


GCAA 


CAAA 


AAAC 


AACG 


A sub. 


TGAA 


GCAA 


CAAA 


AAAC 


AAAG 


C sub. 


TGCA 


GCCA 


CACA 


AACC 


AACG 


G sub. 


TGGA 


GCGA 


CAGA 


AAGC 


AAGG 


T sub. 


TGTA 


GCTA 


CATA 


AATC 


AATG 



In the first "chip" above, the top row of the probes (along 
with one probe below each of the four wild-type probes) should 
bind to the target DNA sequence. However, if the target 
sequence has a T base mutation as suspected, the labeled 
mutant sequence will not bind that strongly to the probes in 
the columns around column 3. For example, the mutant receptor 
that could bind with the probes in column 2 is S'-CGTT which 
may not bind that strongly to any of the probes in column 2 




10 



because there are T bases at the ends of the receptor and 
probes (i.e., not complementary). This often results in a 
relatively dark scanned area around a mutation. 



Opt-tiling ^p^ovadoo the |econd "chip" above 
t-roato - the suspected "mutation as feher wild-type base. Thus, 
the mutant receptor S'-CGTT should bind strongly to the wild- 
type probe of column 2 (along with one probe below) and the 
mutation can be further detected. 

Again referring to Fig. 2A, at step 206 the masks 
for the synthesis are designed. At step 208 the software 
utilizes the mask design and layout information to make th 



co make tne 
ill - COTitroI^ - 



DNA or other polymer chips. This software 2 08 will 

A, 

- among o ther th Aags.^ relative translation of a substrate and 
the mask, the flow of desired reagents through a flow cell, 
the synthesis temperature of the flow cell, and other 
parameters. At step 210, another piece of software is used in 
scanning a chip thus synthesized and exposed to a labeled 
receptor. The software controls the scanning of the chip, and 
stores the data thus obtained in a file that may later be 
utilized to extract sequence information. 

At step 212 a computer system according to the 
present invention utilizes the layout information and the 
fluorescence information to evaluate the hybridized nucleic 
acid probes on the chip. Among the important pieces of 
information obtained from DNA chips are the identification of 
mutant receptors and determination of genetic sequence of a 
particular receptor. 

Fig. 2C illustrates the binding of a particular 
target DNA to an array of DNA probes 114. As shown in this 
simple example, the following probes are formed in the array 
(only one probe is shown for the wild-type probe) : 



3 1 -AGAACGT 
AGACCGT 
AGAGCGT 
AGATCGT 



11 



As shown, the set of probes differ by only one base so the 
probes are designed to determine the identity of the base at 



with the sequence 5 f -TCTTGCA is exposed to the array, it is 
complementary only to the probe 3 1 - AGAACGT , and fluorescein 
will be primarily found on the surface of the chip where 3 1 - 
AGAACGT is located. Thus, for each set of probes that differ 
by only one base, the image file will contain four 
fluorescence intensities, one for each probe. Each 
fluorescence intensity can therefore be associated with the 
base of each probe that is different from the other probes. 
Additionally, the image file will contain a "blank" cell which 
can be used as the fluorescence intensity of the background. 
By analyzing the five fluorescence intensities associated with 
a specific base location, it becomes possible to extract 
sequence information from such arrays using the methods of the 
invention disclosed herein. 

The present invention calls bases by assigning the 
bases the following codes: 




• other - i 



marked) target 



Code 



Group 



Meaning 
Adenine 
Cytosine 
Guanine 

Thymine (Uracil) 

aMino 

puRine 

Weak interaction 

(2 H bonds) 
pYrimidine 
Strong interaction 

(3 H bonds) 
Keto 
not T(U) 
not G 
not C 
not A 

Insufficient intensity 

to call 
Insufficient 

discrimination to 

call 



A 
C 
G 
T 
M 
R 
W 



A 
C 
G 



A or C 
A or G 
A or T(U) 



T(U) 



Y 
S 



C or T(U) 
C or G 



K 
V 
H 
D 
B 
N 



A, C, G, or T(U) 



A, C or T(U) 
A, G or T(U) 
C, G or T(U) 



G or T(U) 
A, C or G 



X 



A, C, G, or T(U) 
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Most of the codes conform to the IUPAC standard. However, 
code N has been redefined and code X has been added. 

II • Intensity Ratio Method 
5 The intensity ratio method is a method of calling 

bases in a sample nucleic acid sequence. The intensity ratio 
method is most accurate when there is good discrimination 
between the fluorescence intensities of hybrid matches and 
hybrid mismatches. If there is insufficient discrimination, 

10 the intensity ratio method assigns a corresponding ambiguity 
code to the unknown base. 

For simplicity, the intensity ratio method will be 
described as being used to identify one unknown base in a 
sample nucleic acid sequence. In practice, the method is used 

15 to identify many or all the bases in a nucleic acid sequence. 

The unknown base will be identified by evaluation of 
up to four mutation probes and a "blank" cell, which is a 
location where a labeled receptor should not bind to the chip 
since no probe is present. For example, suppose a DNA 

2 0 sequence of interest or target sequence contains the sequence 

5 1 -AGAACCTGC-3 » with a possible mutation at the underlined 
base position. Suppose that 5-mer probes are to be 
synthesized for the target sequence. A representative wild- 
type probe of 5 1 -TTGGA is complementary to the region of the 
25 sequence around the possible mutation. The "mutation" probes 
will be the same as the wild-type probe except for a different 
base at the third position as follows: 3'-TTAGA, 3'-TTCGA, 
3 1 -TTGGA , and 3 1 -TTTGA . 

If the f luorescently marked sample sequence is 

3 0 exposed to the above four mutation probes, the intensity 

should be highest for the probe that binds most strongly to 
the sample sequence. Therefore, if the probe 3 1 -TTTGA shows 
the highest intensity, the unknown base in the sample will 
generally be called an A mutation because the probes are 
3 5 complementary to the sample sequence. 

The mutation probes are identical to the wild-type 
probes except that they each contain one of the four A, C, G, 
or T "mutations" for the unknown base. Although one of the 
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<x 



10 



15 



20 



25 



30 



35 
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"mutation" probes will optimally be identical to the wild-type 
probe, such redundant probes are intentionally synthesized for 
quality control and design consistency. 

The identity of the unknown base is preferably 
determined by evaluating the relative fluorescence intensities 
of up to four of the mutation probes, and the "blank" cell. 
Because each mutation probe is identifiable by the mutation 
base, a mutatxon probe's intensity will be referred to the 
"base intensity" of the mutation base. 

As a simple example of the intensity ratio method, 

suppose a 

with the sequence 5 1 — ATGTGGACAGTTGTA— 3~ 
a sample sequence is suspected to have the same sequence as 
the target sequence except for a mutation of base C to base T 
at the underlined base position. Although hundreds of probes 
may be synthesized on the chip, the complementary mutation 
probes synthesized to detect a mutation in the sample sequence 
at the suspected mutation position may be as follows: 
3 1 -TATC 



gene of interest (t^^gtJ^^^^iKHIV protease gene 
seouence 5 1 -atgtggacagttcta-i • "suppose further that 



3 1 -TCTC 
3 • -TGTC 
3 i — TTTC 



(wild-type) 



The mutation probe 3 • -TGTC is also the wild-type probe as it 
should bind most strongly with the target sequence. 

After the sample sequence is labeled, hybridized on 
the chip, and scanned, suppose the following fluorescence 
intensities were obtained: 



3 1 -TATC 
3 1 -TCTC 
3 1 -TGTC 
3 • -TTTC 



-> 45 
-> 8 
-> 32 
-> 12 



40 



where the intensity is measured by the photon count detected 
by the scanner. The "blank" cell had a fluorescence intensity 
of 2. The photon counts in the examples herein are 
representative (not actual data) and provided for illustration 
purposes. In practice, the actual photon counts will vary 
greatly depending on the experiment parameters and the scanner 
utilized. 

Although each fluorescence intensity is from a 
probe, the probes may be characterized by their unique 
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mutation base so the bases may be said to have the following 

intensities: 

A -> 45 
C -> 8 
5 G -> 32 

T -> 12 

Thus, base A will be described as having an intensity of 45 , 
which corresponds to the intensity of the mutation probe with 
the mutation base A. 
10 Initially, each mutation base intensity is reduced 

by the background or "blank" cell intensity. This is done as 
follows: 

A -> 45 - 2 = 43 

C -> 8 - 2 = 6 

15 G _ > 32 ~ 2 = 30 

T -> 12 - 2 - 10^ ^^J^ VsJkA 

Then, the base intensities are sorted fey- intensity. ^The above 

bases would be sorted as follows: 

A -> 43 
20 G -> 30 

T -> 10 

C -> 6 

Next, the highest intensity base is compared to the second 
highest intensity base. Thus, the ratio of the intensity of 

25 base A to the intensity of base G is calculated as follows: 
A:G = 43 / 30 = 1.4. The ratio A: G is then compared to a 
0^ predetermined ratio cutoff which is a number that specifies 

the ratio required to identify the unknown base. For example, 
if the ratio cutoff is 1.2, the ratio A: G is greater than the 

3 0 ratio cutoff (1.4 > 1.2) and the unknown base is called by the 
mutation probe containing the mutation A. As probes are 
complementary to the sample sequence, the sample sequence is 
called as having a mutation T. resulting in a called sample 
(J^ sequence of 5 • -ATGTGGATAGTTGTA-3 * m ^ 

3 5 As another example, suppose everything else is the 

same as in the previous example except that the sorted 

background adjusted intensities were as follows: 

C -> 42 

A -> 40 

40 G -> 10 

T -> 8 
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The ratio of the highest intensity base to the second highest 
intensity base (C:A) is 1.05. Because this ratio is not 
greater than the ratio cutoff of 1.2, the unknown base will be 
called as being ambiguously one of two or more bases as 
follows. 

The second highest intensity base is then compared 
to the third highest base. The ratio of A: G is 4. The ratio 
of A: G is then compared to the ratio cutoff of 1.2. As the 
ratio A:G is greater than the ratio cutoff (4 > 1.2) , the 
unknown base is called by the mutation probes containing the 
mutations C or A. As probes are complementary to the sample 
sequence, the sample sequence is called as having either a 
mutation G^g^^^^e^ulting in a sample sequence of 5 1 - 

ATGTGGAKAGTTGTA-3^' where K is the IUPAC code for G or T(U). 

A 

The ratio cutoff in the previous examples was equal 
to 1.2. However, the ratio cutoff will generally need to be 
adjusted to produce optimal results for the specific chip 
design and wild-type target. Also, although the ratio cutoff 
used has been the same for each ratio comparison, the ratio 
cutoff may vary depending on whether the ratio comparisons 
involve the highest, second highest, third highest, etc. 
intensity base. 

Fig. 3 illustrates the high level flow of the 
intensity ratio method. At step 3 02 the four base intensities 
are adjusted by subtracting the background or "blank" cell 
intensity from each base intensity. Preferably, if a base 
intensity is then less than or equal to zero, the base 
intensity is set equal to a small positive number to prevent 
division by zero or negative numbers in future calculations. 

At step 3 04 the base intensities are sorted by 
intensity. Each base is then associated with a number from 1 
to 4. The base with the highest intensity is 1, second 
highest 2, third highest 3, and fourth highest 4. Thus, the 
intensity of base 1 a base 2 s base 3 s base 4. 

At step 306 the highest intensity base (base 1) is 
checked to see if it has sufficient intensity to call the 
unknown base. The intensity is checked by determining if the 
intensity of base 1 is greater than a predetermined background 
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difference cutoff. The background difference cutoff is a 
number that specifies the intensity a base intensity must be 
over the background intensity in order to correctly call the 
unknown base. Thus, the background adjusted base intensity 
5 must be greater than the background difference cutoff or the 
unknown is not callable. 

If the intensity of base 1 is not greater than the 
background difference cutoff, the unknown base is assigned the 
code N (insufficient intensity) as shown at step 308. 

10 Otherwise, the ratio of the intensity of base 1 to base 2 is 
calculated as shown at step 310. . fa^g^ 

At step 312 the ratio or inte n si ty of -base 1:2 is 
compared to the ratio cutoff. If the ratio 1:2 is greater 
than the ratio cutoff, the unknown base is called as the 

15 complement of the highest intensity base (base 1) as shown at 
step 314. Otherwise, the ratio of the intensity of base 2 to 



A, 



base 3 is calculated as shown at st 



mtcnh - ity of :3base- 2 : 



At step 318 the ratio of mtcrih - ity of ^ase- 2:3 is 

A- /C 

compared to the ratio cutoff. If the ratio 2:3 is greater 
20 than the ratio cutoff, the unknown base is called as being an 
ambiguity code specifying the complements of the highest or 
second highest intensity bases (base 1 or 2) as shown at step 
320. Otherwise, the ratio of the intensity of base 3 to base 
4 is calculated as shown at step 322 
25 At step 324 the ratio of irfrfc en si ty of 4»se 3:4 is 

compared to the ratio cutoff. If the ratio 3:4 is greater 
than the ratio cutoff, the unknown base is called as being an 
ambiguity code specifying the complements of the highest, 
second highest, or third highest bases (base 1, 2 or 3) as 
3 0 shown at step 32 6. Otherwise, the unknown base is assigned 

the code X (insufficient discrimination) as shown at step 328. 

The advantage of the intensity ratio method is that 
it is very accurate when there is good discrimination between 
the fluorescence intensities of hybrid matches and hybrid 
35 mismatches. However, if the base corresponding to a correct 
hybrid gives a lower intensity than a mismatch (e.g. , as a 
result of cross-hybridization) , incorrect identification of 
the base will result. For this reason, however, the method is 
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useful for comparative assessment of hybridization quality and 
as an indicator of sequence-specific problem spots. For 
example f the intensity ratio method has been used to determine 
that ambiguities and miscalls tend to be very different from 
5 sequence to sequence, and reflect predominantly the 

composition and repetitiveness of the sequence. It has also 
been used to assess improvements obtained by varying 
hybridization conditions, sample preparation, and post- 
hybridization treatments (e.g., RNase treatment). 

10 

III . Reference Method 

The reference method is a method of calling bases in 
a sample nucleic acid sequence. The reference method depends 
very little on discrimination between the fluorescence 

15 intensities of hybrid matches and hybrid mismatches, and 

therefore is much less sensitive to cross-hybridization. The 
method compares the probe intensities of a reference sequence 
to the probe intensities of a sample sequence. Any 
significant changes are flagged as possible mutations. There 

20 are two implementations of the reference method disclosed 
herein. 

For simplicity, the reference method will be 
described as being used to identify one unknown base in a 
sample nucleic acid sequence. In practice, the method is used 

25 to identify many or all the bases in a nucleic acid sequence. 

The unknown base will be called by comparing the 
probe intensities of a reference sequence to the probe 
intensities of a sample sequence. Preferably, the probe 
intensities of the reference sequence and the sample sequence 

30 are from chips having the same chip wild-type. However, the 
reference sequence may or may not be exactly the same as the 
chip wild-type as it may have mutations. 

The bases at the same position in the reference and 
sample sequences will each be associated with up to four 

35 mutation probes and a "blank" cell. The unknown base in the 
sample sequence is called by comparing probe intensities of 
the sample sequence to probe intensities of the reference 
sequence. For example, suppose the chip wild-type contains 
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the sequence 5 1 - AGACCTTGC- 3 » and it is suspected that the 
sample has a possible mutation at the underlined base 
position, which is the unknown base that will be called by the 
reference method. The "mutation" probes for the sample 
sequence may be as follows: 3 1 -GAAA , 3 1 — GCAA , 3'-GGAA, and 
3 1 -GTAA , where 3 • -GGAA is the wild-type probe. 

Suppose further that a reference sequence, which 
differs from the chip wild-type by one base mutation, has the 
sequence 5 ■ -AGACATTGC-3 • where the mutation base is 
underlined. The "mutation" probes for the reference sequence 
may be as follows: 3 1 — TGAAA , 3 1 — TGCAA , 3 1 — TGGAA , and 3'- 
TGTAA, where 3 1 -TGTAA is the reference wild-type probe since 
the reference sequence is known. Although generally the 
sample and reference sequences were tiled with the same chip 
wild-type, this is npt reauire^andt^ tilin^methods do not 
have to be ideritical as shovm in the example. ^/rhus, the 
unknown base will be called by comparing the "mutation" probes 
of the sample sequence to the "mutation" probes of the 
reference sequence. As before, because each mutation probe is 
identifiable by the mutation base, the mutation probes' 



intensities will be referred to the "base intensities" of 

A 

their respective mutation bases. 

As a simple example of one implementation of the 
reference method, suppose _a qene / \of interest (target) has the 
sequence 5 1 -AAAACTGAKAA-3 • . Suppose, a reference sequence has 
the sequence 5 1 -AAAACCGAAAA-3 % § "which differs from the target 
sequence by the underlined base. The reference sequence is 
marked and exposed to probes on a chip with the target 
sequence being the chip wild-type. Suppose further that a 
sample sequence is suspected to have the same sequence as the 
target sequence exc^pt^ for^ ^tation at the underlined base 
position in 5 1 -AAAACTGAAAA-3 x jr The sample sequence is also 
marked and exposed to probes on a chip with the target 
sequence being the chip wild-type. After hybridization and, 
scanning, the following probe intensities (not actual data) 
were found for the respective complementary probes: 
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Sample 
3 • -GACT 






-> 


11 


3 ' -GCCT 


-> 


30 


3 ' -GGCT 


-> 


60 


3 • -GTCT 


-> 


6 



Reference 

3 1 -TGAC -> 12 

3 f -TGCC -> 9 

3 1 -TGGC -> 80 

3 1 -TGTC -> 15 

Although each fluorescence intensity is from a probe, the 

probes may be identified by their unique mutation base so the 

bases may be said to have the following intensities: 

Reference Sample 

A -> 12 A -> 11 

C -> 9 C -> 30 

G -> 80 G -> 60 

T -> 15 T -> 6 

Thus, base A of the reference sequence will be described as 

having an intensity of 12, which corresponds to the intensity 

of the mutation probe with the mutation base A. The reference 

method will now be described as calling the unknown base in 

the sample sequence by using these intensities. 

Fig. 4 A illustrates the high level flow of one 

implementation of the reference method. For illustration 

purposes, the reference method is described as filling in the 

columns (identified by the numbers along the bottom) of the 

analysis table shown in Fig. 4B. However, the generation of 

an analysis table is not necessary to practice the method. 

The analysis table is shown to aid the reader in understanding 

the method. 

At step 4 02 the four base intensities of the 

reference and sample sequences are adjusted by subtracting the 

background or "blank" cell intensity from each base intensity. 

Each set of "mutation" probes has an associated "blank" cell. 

Suppose that the reference "blank" cell intensity is 1 and the 

sample "blank" cell intensity is 2. The base intensities are 

then background subtracted as follows: 

Reference Sample 

A -> 12 - 1 = 11 A -> 11 - 2 = 9 

C -> 9 - 1 = 8 C -> 30 - 2 = 28 

G -> 80 - 1 = 79 G -> 60 - 2 = 58 

T -> 15 - 1 = 14 T -> 6-2 = 4 

Preferably, if a base intensity is then less than or equal to 

zero, the base intensity is set equal to a small positive 

number to prevent division by zero or negative numbers in 

future calculations. 
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For identification, the position of t h e b ases- of 

A. 

interest in the reference and sample sequences is placed in 
column 1 of the analysis table. Also, since the reference 
sequence is a known sequence, the base at this position is 
5 known and is referred to as the reference wild-type. The 
reference wild-type is placed in column 2 of the analysis 
table, which is C for this example. 

At step 4 04 the base intensity associated with the 
reference wild-type (column 2 of the analysis table) is 

10 checked to see if it has sufficient intensity to call the 

unknown base. In this example, the reference wild-type is C. 
However, the base intensity associated with the wild-type is 
the G base intensity, which is 79 in this example. This is 
because the base intensities actually represent the 

15 complementary "mutation" probes. The G base intensity is 
checked by determining if its intensity is greater than a 
predetermined background difference cutoff. The background 
difference cutoff is a number that specifies the intensity the 
base intensities must be above the background intensity in 

2 0 order to correctly call the unknown base. Thus, the base 

intensity associated with the reference wild-type must be 
greater than the background difference cutoff or the unknown 
base is not callable. 

If the background difference cutoff is 5, the base 
25 intensity associated with the reference wild-type has 

sufficient intensity (79 > 5) so a P (pass) is placed in 
column 3 of the analysis table as shown at step 406. 
Otherwise, at step 4 07 an F (fail) is placed in column 3 of 
the analysis table. 

3 0 At step 408 the ratio of the base intensity 

associated with the reference wild-type to each of the 
possible bases are calculated. The ratio of the base 
intensity associated with the reference wild-type to itself 
will be 1 and the other ratios will usually be greater than 1. 
35 The base intensity associated with the reference wild-type is 
G so the following ratios are calculated: 
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G: A -> 79 / 11 = 7.2 
G:C -> 79 / 8 = 9.9 
G: G -> 79 / 79 = 1.0 
G:T -> 79 / 14 = 5.6 

5 These ratios are placed in columns 4 through 7 of the analysis 

table , respectively . 

At step 410 the highest base intensity associated 

with the sample sequence is checked to see if it has 

sufficient intensity to call the unknown base. The highest 

10 base intensity is checked by determining if the intensity is 
greater than the background difference cutoff. Thus, the 
highest base intensity must be greater than the background 
difference cutoff or the unknown base is not callable. 

Again, if the background difference cutoff is 5, the 

15 highest base intensity, which is G in this example, has 
sufficient intensity (58 > 5) so a P (pass) is placed in 
column 8 of the analysis table as shown at step 412. 
Otherwise, at step 413 an F (fail) is placed in column 8 of 
the analysis table. 

20 At step 414 the ratios of the highest base intensity 

of the sample to each of the possible bases are calculated. 
The ratio of the highest base intensity to itself will be 1 
and the other ratios will usually be greater than 1. Thus, 
the highest base intensity is G so the following ratios are 

25 calculated: 

G: A -> 58 / 9 = 6.4 

G:C -> 58 / 28 = 2.3 

G:G -> 58 / 58 = 1.0 

G: T -> 58 / 4 = 14.5 

3 0 These ratios are placed in columns 9 through 12 of the 
analysis table, respectively. 

At step 416 if both the reference and sample 
sequence probes failed to have sufficient intensity to call 
the unknown base, meaning there is an ■ F ' in columns 3 and 8 

35 of the analysis table, the unknown base is assigned the code N 
(insufficient intensity) as shown at step 418. An 'N' is 
placed in column 17 of the analysis table. Additionally, a 
confidence code of 9 is placed in column 18 of the analysis 
table where the confidence codes have the following meanings: 
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Code Meaning 

0 Probable reference wild-type 

1 Probable mutation 

2 Reference sufficient intensity, 

5 insufficient intensity in sample 

suggests possible mutation 

3 Borderline differences, unknown base 

ambiguous 

4 Sample sufficient intensity, insufficient 
10 intensity in reference to allow 

comparison 
5-8 Currently unassigned 

9 Insufficient intensity in reference and 
m sample, no interpretation possible 



15 The confidence codes are useful for indicating to the user the 
resulting analysis of the reference method. 

At step 420 if only the reference sequence probes 
failed to have sufficient intensity to call the unknown base, 
meaning there is an 1 F 1 in column 3 and a ' P' in column 8 of 

20 the analysis table, the unknown base is assigned the code N 
(insufficient intensity) as shown at step 422. An 'N 1 is 
placed in column 17 and a confidence code of 4 is placed in 
column 18 of the analysis table. 

At step 424 if only the sample sequence probes 

25 failed to have sufficient intensity to call the unknown base, 
meaning there is a 1 P 1 in column 3 and a • F' in column 8 of 
the analysis table, the unknown base is assigned the code N 
(insufficient intensity) as shown at step 426. An 'N 1 is 
placed in column 17 and a confidence code of 2 is placed in 

3 0 column 18 of the analysis table. 

In this example, both the reference and sample 
sequence probes have sufficient intensity to call the unknown 
base. At step 428 the ratios of the reference ratios to the 
sample ratios for each base type are calculated. Thus, the 
35 ratio A: A (column 4 to column 9) is placed in column 13 of the 
analysis table. The ratio C:C (column 5 to column 10) is 
placed in column 14 of the analysis table. The ratio G:G 
(column 6 to column 11) is placed in column 15 of the analysis 
table. Lastly, the ratio T:T (column 7 to column 12) is 

4 0 placed in column 16 of the analysis table. These ratios are 

calculated as follows: 
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A: A -> 7.2 / 6.4 = 1.1 
C:C -> 9.9 / 2.3 = 4.3 
G: G -> 1.0 / 1.0 = 1.0 
T:T -> 5.6 / 14.5 = 0.4 

The unknown base is called by comparing these ratios of ratios 

to two predetermined values as follows. 

At step 430 if all the ratios of ratios (columns 13 
to 16 of the analysis table) are less than a predetermined 
lower ratio cutoff, the unknown base is assigned the code of 
the reference wild-type as shown at step 432. Thus, the code 
for the reference wild-type (as shown in colu^^^w^uld be 
placed in column 17 and a confidence code of 0 placed in 
column 18 of the analysis table. 

At step 434 if all the ratios of ratios are less 

than a predetermined upper ratio cutoff, the unknown base is 

assigned an ambiguity code that indicates the unknown base may 

be any one of the bases that has a complementary ratio of 

ratios greater than the lower ratio cutoff and less than the 

upper ratio cutoff as shown at step 436. Thus, if the ratio 

of ratios for A: A, C:C and G:G are all greater than the lower 

ratio cutoff and less than the upper ratio cutoff, the unknown 

base would be assigned the code B (meaning "not A") . This is 

because the ratios of ratios are complementary to their 

respective base as follows: 

A: A -> T 
C:C -> G 
G:G -> C 



so the unknown base ie- called as being either C, G, or T, 
which is identified by the IUPAC code B. This ambiguity code 

^rs placed in column 17 and a confidence code of 3 would be 

A 

placed in column 18 of the analysis table. 

At step 4 38 at least one of the ratios of ratios is 
greater than the upper ratio cutoff and the unknown base is 
called as the base complementary to the highest ratio of 
ratios. The code for the base complementary to the highest 
ratio of ra±ios would be placed in column 17 and a confidence 

code or 1 ie- placed in column 18 of the analysis table. 

A 

Assume for the purposes of this example that the 
lower ratio cutoff is 1.5 and the upper ratio cutoff is 3. 
Again, the ratios of ratios are as follows: 
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A: A -> 1.1 
C:C -> 4.3 
G:G -> 1.0 
T: T -> 0.4 

5 As all the ratios of ratios are not less than the upper ratio 
cutoff, the unknown base is called the base complementary to 
the highest ratio of ratios* The highest ratio of ratios is 
C:C, which has a complementary base G. Thus, the unknown base 
is called G which is placed in column 17 and a confidence code 

10 of 1 is placed in column 18 of the analysis table. 

The example shows how the unknown base in the sample 
nucleic acid sequence was correctly called as base G. 
Although the complementary "mutation" probe associated with 
the base G (3'-GCCT) did not have the highest fluorescence 

15 intensity, the unknown base was called as base G because the 
associated "mutation" probe had the highest ratio increase 
over the other "mutation" probes. 

Fig. 5A illustrates the high level flow of another 
implementation of the reference method. As in the previous 

2 0 implementation, this implementation also compares the probe 

intensities of a reference sequence to the probe intensities 
of a sample sequence. However, this implementation differs 
conceptually from the previous implementation in that 
neighboring probe intensities are also analyzed, resulting in 
25 more accurate base calling. 

As a simple example of this implementation of the 
reference method, ^g^pose ^r^ference sequence has a sequence 
(L of 5 ' -AAACCCAATCCACATCA-3 ■ ana a sample sequence has a 

fi, sequence of 5 1 -AAACCCAGTCCACATCA-3^, where the mutant base is 

3 0 underlined. Thus, there is a mutation of A to G. Suppose 

further that the reference and sample sequences are tiled on 
chips with the reference sequence being the chip wild-type. 
This implementation of the reference method will be described 
as identifying this mutation base. 
35 For illustration purposes, this implementation of 

the reference method is ^described, as filling in a data table 
^ \jrc^ shown in Fig. 5B. Although the data table contains more data 

A 

than is required for this implementation, the portions of the 
data table that are produced by steps in Fig. 5A are shown 
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with the same reference numerals. The generation of a data 
table is not necessary, however, and is shown to aid the 
reader in understanding the method. The mutant base position 
is at position 241 in the reference and sample sequences, 
5 which is shown in bold in the data table. 

At step 502 the base intensities of the reference 
and sample sequences are adjusted by subtracting the 
background or "blank" cell intensity from each base intensity. 
Preferably, if a base intensity is then less than or equal to 
10 zero, the base intensity is set equal to a small positive 

umbnr to prevent division by zero or negative numbers. In the 

/ 

data table, data 502A is the background subtracted base 
intensities for the reference sequence and data 502B is the 
background subtracted base intensities for the sample sequence 

15 (also called the "mutant" sequence in the data table) . 

At step 504 the base intensity associated with the 
reference wild-type is checked to see if it has sufficient 
intensity to call the unknown base. In this example, the 
reference wild-type is base A at position 241. The base 

20 intensity associated with the reference wild-type is 

identified by a lower case "a" in the left hand column. Thus, 
the base intensities in the data table are not identified by 
their complements and the reference wild-type at the mutation 
position has an intensity of 385. The reference wild-type 

25 intensity of 385 is checked by determining if its intensity is 
greater than a predetermined background difference cutoff. 
The background difference cutoff is a number that specifies 
the intensity the base intensities must be over the background 
intensity in order to correctly call the unknown base. Thus, 

3 0 the base intensity associated with the reference wild-type 

must be greater than the background difference cutoff or the 
unknown base is not callable. 

If the base intensity associated with the reference 
wild-type is not greater than the background difference 

35 cutoff, the wild-type sequence would fail to have sufficient 
intensity as shown at step 506. Otherwise, at step 508 the 
wild-type sequence would pass by having sufficient intensity. 
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At step 510 calculations are performed on the 

background subtracted base intensities of the reference 

sequence in order to "normalize" the intensities. Each 

position in the reference sequence has four background 

5 subtracted base intensities associated with it. The ratio of 

the intensity of each base to the sum of the intensities of 

the possible bases (all four) is calculated, resulting in four 

ratios, one for each base as shown in the data table. Thus, 

the following ratios would be calculated at each position in 

10 the reference sequence: 

A ratio =A/ (A+C+G+T) 

C ratio = C / (A+C+G+T) 

G ratio = G / (A+C+G+T) 

T ratio =T/ (A+C+G+T) 

15 At position 241, A ratio would be the wild-type ratio. These 
ratios are generally calculated in order to "normalize" the 
intensity data as the photon counts may vary widely from 
experiment to experiment. Thus, the^at^o^ provide a way of 
^ reconciling the intensity variations jj^ otwo on experiments. 

20 Preferably, if the photon counts do not vary widely from 

experiment to experiment, the probe intensities do not need to 
be "normalized." 

At step 512 the highest base intensity associated 
with the sample sequence is checked to see if it has 

25 sufficient intensity to call the unknown base. The intensity 
is checked by determining if the highest intensity sample base 
is greater than the background difference cutoff. If the 
intensity is not greater than the background difference 
cutoff, the sample sequence fails to have sufficient intensity 

30 as shown at step 514. Otherwise, at step 516 the sample 
sequence passes by having sufficient intensity. 

At step 518 calculations are performed on the 
background subtracted base intensities of the sample sequence 
in order to "normalize" the intensities. Each position in the 

3 5 sample sequence has four background subtr^c^ed base 

intensities associated with it. The ^^^o- of the intensity of 
each base to the sum of the intensities of the possible bases 
^ (all four) -is- calculated, resulting in four ratios, one for 

each base as shown in the data table. 
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At step 520 if either the reference or sample 
sequences failed to have sufficient intensity, the unknown 
base is assigned the code N (insufficient intensity) as shown 
at step 522. ^AA^ 
5 At step 524 the normalized base intonp it - ioc s- of the 

reference seguence are subtracted from the normalized base 
^ '^iftfeens ijttes- of the sample sequence. Thus, at each position 

the following calculations are performed: 

A Difference = Sample A Ratio - Reference A Ratio 

10 C Difference = Sample C Ratio - Reference C Ratio 

G Difference = Sample G Ratio - Reference G Ratio 

T Difference = Sample T Ratio - Reference T Ratio 

where the reference and sample ratios are calculated at steps 

510 and 518, respectively. The base differences resulting 

15 from these calculations are shown in the data table. 

At step 52 6 each position is checked to see if there 

is a base difference greater than an upper difference cutoff 

and a base difference lower than a lower difference cutoff. 

For example, Fig. 5C shows a graph the normalized sample base 

2 0 intensities minus the normalized reference base intensities. 

Suppose that the upper difference cutoff is 0.15 and the lower 
difference cutoff is -0.15 as shown by the horizontal lines in 
Fig. 5C. At the mutation position (labeled with a reference 
0), the G difference is 0.28 which is greater than 0.15, the 
25 upper difference cutoff. Similarly, the A difference is -0.32 
which is less than -0.15, the lower difference cutoff. As 
there is a base difference above the upper difference cutoff 
and a base difference below the lower difference cutoff, there 
may be mutation at^thi^^osition. 
a 30 If there is base difference above the upper 

difference cutoff jp*d- a base difference below the lower 
difference cutoff, the base at that position is assigned the 
code of the reference wild-type base as shown at step 528. 

At step 53 0 the ratio of the highest background 

3 5 subtracted base intensity in the sample to the background 

subtracted reference wild-type base intensity is calculated. 
For example, at the mutation position 241 in the data table, 
the highest background subtracted base intensity in the sample 
is 571 (base G) . The background subtracted reference wild- 
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type base intensity is 385 (base A) . Thus, the ratio of 
571:385 is calculated and results in 1.48 as shown in the data 
table. 

At step 532 these ratios are compared to a ratio at 
a neighboring position. The ratio for the n th position is 
subtracted from the ratio for the r th position, where r = n + 
1. For example, at the mutation position 241 in the data 
table, the ratio at position 242 (which equals 1.02) is 
subtracted from the ratio at position 241 (which equals 1.48). 
It has been found that a mutant can be confidently detected by 
analyzing the difference of these neighboring ratios. 

Fig. 5D shows other graphs of data in the data 
table. Of particular importance is the graph identified as 
532 because this is a graph of the calculations at step 532. 
The pattern shown in a box in graph 532 has been found to be 
characteristic of a mutation. Thus, if this pattern is 
detected, the base is called as the base (or bases) with a 
normalized difference greater than the upper difference cutoff 
as shown at step 53 6. For example, the pattern was detected 
and at step 52 6 it was shown that base G had a normalized 
difference of 0.28, which is greater than the upper difference 
cutoff of 0.15. Therefore, the base at position 241 in the 
sample sequence is called a base G, which is a mutation from 
the reference sequence (A to G) . 

If the pattern is not detected at step 534, the base 
at that position is assigned the code of the reference wild- 
type base as shown at step 538. 

This second implementation of the reference method 
is preferable in some instances as it. takes account probe 

intensities of neighboring probes. T hus , th e first 
implementation may not have detected the A to G mutation in 
this example. 

The advantage of the reference method is that the 
correct base can be called even in the presence of significant 
levels of cross-hybridization, as long as ratios of 
intensities are fairly consistent from experiment to 
experiment. In practice, the number of miscalls and 
ambiguities is significantly reduced, while the number of 
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correct calls is actually increased, making the reference 
method very useful for identifying candidate mutations. The 
reference method has also been used to compare the 
reproducibility of experiments in terms of base calling. 

5 

IV. Statistical Method 

The statistical method is a method of calling bases 
in a sample nucleic acid sequence. The statistical method 
utilizes the statistical variation across experiments tcj^call 
£L 10 the bases. Therefore, the statistical metly^^^i^ ^ood at - 

-ea lling bacos if - data from multiple experiments is available 
^ and the data is fairly consistent jpneng- the experiments. The 

method compares the probe intensities of a sample sequence to 
statistics of probe intensities of a reference sequence in 
15 multiple experiments. 

For simplicity, the statistical method will be 
described as being used to identify one unknown base in a 
sample nucleic acid sequence. In practice, the method is used 
to identify many or all the bases in a nucleic acid sequence. 
20 The unknown base will be called by comparing the 

probe intensities of a sample sequence to statistics on probe 
intensities of a reference sequence in multiple experiments. 
Generally, the probe intensities of the sample sequence and 
the reference sequence experiments are from chips having the 
25 same chip wild-type. However, the reference sequence may or 
may not be equal to the chip wild-type , as it may have 
mutations. 

A base at the same position in the reference and 
sample sequences will be associated with up to four mutation 
3 0 probes and a "blank" cell. As before, because each mutation 
probe is identifiable by the mutation base, the mutation 
probes' intensities will be referred to as the "base 
intensities" of their respective mutation bases. 

As a simple example of the statistical method, 
35 suppose^^gen^ oi^ interest (target) has the sequence 5'- 

AAAACTGAAAA- 3 ^.^^u^os^ a Reference sequence has the sequence 
0b 5 ' -AAAACCGAAAA-3 1 , wfiich 'differs from the target sequence by 

the underlined base. Suppose further that a sample sequence 
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a .T base mutation at the underlined base position 
:TGAAAA-3 ■ . Suppose that in multiple experiments 



is suspected to have the same sequence as the target sequence 
except for a ^ Jdsls$ 
in 5 1 -AAAACTC. . 

A 

the reference sequence is marked and exposed to probes on a 
chip. Suppose further the sample sequence is also marked and 
exposed to probes on a chip. 

The following are complementary "mutation" probes 
that could be used for a reference experiment and the sample 
sequence: 

Reference Sample 
3 ■ -TGAC 3 1 — GACT 

3 ' -TGCC 3 1 -GCCT 

3 1 -TGGC 3 1 -GGCT 

3 1 -TGTC 3 1 -GTCT 

The "mutation" probes shown for the reference sequence may be 

from only one experiment, the other experiments may have 

different "mutation" probes, chip wild-types, tiling methods, 

and the like. Although each fluorescence intensity is from a 

probe, since the probes may be identified by their unique 

mutation bases, the probe intensities may be identified by 

their respective bases as follows: 

Reference Sample 

3 1 -TGAC -> A 3 ' -GACT -> A 

3 1 -TGCC -> C 3 1 -GCCT -> C 

3 ■ -TGGC -> G 3 1 -GGCT -> G 

3 ' -TGTC -> T 3 ' -GTCT -> T 

Thus, base A of the reference sequence will be described as 

having an intensity which corresponds to the intensity of the 

mutation probe with the mutation base A. The statistical 

method will now be described as calling the unknown base in 

the sample sequence by using this example. 

Fig. 6 illustrates the high level flow of the 
statistical method. At step 602 the four base intensities 
associated with the sample sequence and each of the multiple 
reference experiments are adjusted by subtracting the 
background or "blank" cell intensity from each base intensity. 
Preferably, if a base intensity is then less than or equal to 
zero, the base intensity is set equal to a small positive 
number to prevent division by zero or negative numbers. 

At step 604 the intensities of the reference wild- 
type bases in the multiple experiments are checked to see if 
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they all have sufficient intensity to call the unknown base. 
The intensities are checked by determining if the intensity of 
the reference wild-type base of an experiment is greater than 
a predetermined background difference cutoff. The wild-type 
5 probe shown earlier for the reference sequence is S'-TGGC, and 
thus the G base intensity is the wild- type base intensity. 
These steps are analogous to steps in the other two methods 
described herein. 

If the intensity of any one of the reference wild- 

10 type bases is not greater than the background difference 
cutoff, the wild-type experiments fail to have sufficient 
intensity as shown at step 606. Otherwise, at step 608 the 
wild-type experiments pass by having sufficient intensity. 

At step 610 calculations are performed on the 

15 background subtracted base intensities of each of the 

reference experiments in order to "normalize" the intensities. 
Each reference experiment has four background subtracted base 
intensities associated with it: one wild-type and three for 
the other possible bases. In this example, the G base 

20 intensity is the wild-type, the A, C, and T base intensities 

being the "other" intensities. The ratios of the intensity of 
each base to the sum of the intensities of the possible bases 
(all four) are calculated, giving one wild-type ratio and 
three "other" ratios. Thus, the following ratios would be 

25 calculated: 

A ratio =A/ (A+C + G + T) 

C ratio = C / (A+C+G+T) 

G ratio = G / (A+C+G+T) 

T ratio =T/ (A+C+G+T) 

3 0 where G ratio is the wild-type ratio and A, C, and T ratios 

are the "other" ratios. These four ratios are calculated for 
each reference experiment. Thus if the number of reference 
experiments is n, there would be 4n ratios calculated. These 
ratios are generally calculated in order to "normalize" the 

35 intensity data, as the photon counts may vary widely from 

experiment to experiment. However, if the probe intensities 
do not vary widely from experiment to experiment, the probe 
intensities do not need to be "normalized." 
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At step 612 statistics are prepared for the ratios 

calculated for each of the reference experiments. As stated 

before, each reference experiment will be associated with one 

wild-type ratio and three "other" ratios. The mean and 

5 standard deviation are calculated for all the wild-type 

ratios. The mean and standard deviation are also calculated 

for each of the other ratios, resulting in three other means 

and standard deviations for each of the bases that is not the 

wild-type base. Therefore, the following would be calculated: 

10 Mean and standard deviation of A ratios 

Mean and standard deviation of C ratios 
Mean and standard deviation of G ratios 
Mean and standard deviation of T ratios 

where the mean and standard deviation of the G ratios are also 

15 known as the wild-type mean and the wild-type standard 

deviation, respectively. The mean and standard deviation of 

the A, C, and T means and standard deviations are also known 

collectively as the "other" means and standard deviations. 

Suppose that the preceding calculations produced the 

20 following data: 

A ratios -> mean = 0.16 std. dev. = 0.003 

C ratios -> mean = 0.03 std. dev. = 0.002 

G ratios -> mean = 0.71 std. dev. = 0.050 

T ratios -> mean = 0.11 std. dev. = 0.004 

25 In one embodiment, the steps up to and including 

step 612 are performed in a preprocessing stage for the 

multiple wild-type experiments. The results of the 

preprocessing stage are stored in a file so that the reference 

calculations do not have to be repeatedly calculate^t^/ ^^io^^ 

0^ 3 0 ^es ulto in - incrcaoo d performance. Microfiche Appendices C and 

D contain the programming code to perform the preprocessing 

stage . 

At step 614 the highest base intensity associated 
with the sample sequence is checked to see if it has 
35 sufficient intensity to call the unknown base. The intensity 
is checked by determining if the highest intensity unknown 
base is greater than the background difference cutoff. If the 
intensity is not greater than the background difference 
cutoff, the sample sequence fails to have sufficient intensity 
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as shown at step 616. Otherwise, at step 618 the sample 

sequence passes by having sufficient intensity. 

At step 620 calculations are performed on the four 

background subtracted intensities of the sample sequence. The 

^ 5 4raM.S of the background subtracted intensity of each base to 

the sum of the background subtracted intensities of the 

possible bases (all four) ^^calculated, giving four ratios, 

one for each base. For consistency, the ratio associated with 

the reference wild-type base is called the wild-type ratio, 

10 with there being three "other" ratios. Thus, the following 

ratios would b e- calculated; 

A ratio = A / (A+C+G+T) 

C ratio = C / (A+C+G+T) 

G ratio =G/ (A+C+G+T) 

15 T ratio =T/ (A+C+G+T) 

where ratio G is the wild-type ratio and ratios A, C, and T 

are the "other" ratios. 

Suppose the background subtracted intensities 

associated with the sample are as follows: 

20 A -> 310 

C -> 50 
G -> 26 

T -> 100 

Then, the corresponding ratios would be as fellows: 

25 A ratio = 310 / (310 + 50 + 26 + 100) = 0.64 

C ratio = 50 / (310 + 50 + 26 + 100) = 0.10 

G ratio = 26 / (310 + 50 + 26 + 100) = 0.05 

T ratio = 100 / (310 + 50 + 26 + 100) = 0.21 

At step 622 if either the reference experiments or 

3 0 the sample sequence failed to have sufficient intensity, the 

unknown base is assigned the code N (insufficient intensity) 

as shown at step 624. 

At step 626 the wild-type and "other" ratios 

associated with the sample sequence are compared to 

35 statistical expressions. The statistical expressions include 

four predetermined standard deviation cutoffs, one associated 

with each base. Thus, there is a standard deviation cutoff 

for each of the bases A, C, G, and T. Tne^stairaard deviation 

cutoffs allow the unknown base to be called with higher 

40 precision because each standard deviation cutoff can be set to 
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a different value. Suppose the standard deviation cutoffs are 

set as follows: 

A standard deviation cutoff -> 4 

C standard deviation cutoff -> 2 

G standard deviation cutoff -> 8 

T standard deviation cutoff -> 4 

The wild-type base ratio associated with the sample is 

compared to a corresponding statistical expression: 

WT ratio a WT mean - (WT std. dev. * WT base std. 
dev. cutoff) 

where the WT base std. dev. cutoff is the standard deviation 

cutoff for the wild-type base. As the wild-type base is G, 

the above comparison solves to the following: 

0.05 => 0.71 - (0.050 * 8) 
0.05 * 0.31 

which is not a true expression (0.05 is not greater than 
0.31) . 

Each of the "other" ratios associated with the 

sample is compared to a corresponding statistical expression: 

Other ratio > Other mean + (Other std. dev. * Other 
base std. dev. cutoff) 

where the Other base std. dev. cutoff is the standard 

deviation cutoff for the particular "other" base. Thus, the 

above comparison solves to the following three expressions: 
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which are all true expressions. 

At step 62 8 if only the wild-type ratio of the 
sample sequence was greater than the statistical expression, 
the unknown base is assigned the code of the reference wild- 
type base as shown at step 630. 

At step 632 if one or more of the "other" ratios of 
the sample sequence were greater than their respective 
statistical expressions, the unknown base is assigned an 
ambiguity code that indicates the unknown base may be any one 
of the complements of these bases , including the reference 
wild-type. In this example, the "other" ratios for A, C, and 
T were all greater than their corresponding statistical 
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expression. Thus, the unknown base would be called the 
complements of these bases, represented by the subset T, G, 
and A. Thus, the unknown base would be assigned the code D 
(meaning "not C") . 

If none of the ratios are greater than their 
respective statistical expressions, the unknown base is 
assigned the code X (insufficient discrimination) as shown at 
step 636. 

The statistical method provides accurate base 
calling because it utilizes statistical data from multiple 
reference experiments to call the unknown base. The 
statistical method has also been used to implement confidence 
estimates and calling of mixed sequences. 

V. Pooling Processing 

The present invention provides pooling processing 
which is a method of processing reference and sample nucleic 
acid sequences together to reduce variations across individual 
experiments. In the representative embodiment discussed 
herein, the refe^n^p and sample nucleic acid sequences are 
labeled witn/y^uorescent markers emitting light at different 
wavelengths. However, the nucleic acids may be labeled with 
other types of markers including distinguishable radioactive 
markers . 

After the reference and sample nucleic acid 
sequences are labeled with different color fluorescent 
markers, the labeled reference and sample nucleic acid 
sequences are then combined and processed together. An 
apparatus for detecting targets labeled with different markers 
is provided in U.S. Application No. 08/195,889 and is hereby 
incorporated by reference for all purposes. 

Fig. 7 illustrates the pooling processing of a 
reference and sample nucleic acid sequence. At step 702 a 
reference nucleic acid sequence is marked with a fluorescent 
dye, such as a fluorescein. At step 704 a sample nucleic acid 
sequence is marked with a dye that, upon excitation, emits 



light that of a different wavelength than the fluorescent dye 
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of the reference sequence. For example, the sample nucleic 
acid sequence may be marked with rhodamine. 

At step 706 the labeled reference sequence and the 
labeled sample sequence are combined. After this step, 
5 processing continues in the same manner as for only one 

labeled sequence. At step 708 the sequences are fragmented. 
The fragmented nucleic acid sequences are then hybridized on a 
chip containing probes as shown at step 710. 

At step 712 a scanner generates image files that 
10 indicate the locations where the labeled nucleic acids bound 
to the chip. In general, the scanner generates an image file 
by focusing excitation light on the hybridized chip and 
detecting the fluorescent light that is emitted. The marker 
emitting the fluorescent light can be identified by the 
15 wavelength of the light. For example, the fluorescence peak 
of fluorescein is about 53 0 nm while that of a typical 
rhodamine dye is about 580 nm. 

The scanner creates an image file for the data 
associated with each fluorescent marker, indicating the 

2 0 locations where the correspondingly labeled nucleic acid bound 

to the chip. Based upon an analysis of the fluorescence 
intensities and locations, it becomes possible to extract 
information such as the monomer sequence of DNA or RNA. 

Pooling processing reduces variations across 

25 individual experiments because much of the test environment is 
common. Although pooling processing has been described as 
being used to improve the combined processing of reference and 
sample nucleic acid sequences, the process may also be used 
for two reference sequences, two sample sequences, or multiple 

30 sequences by utilizing multiple distinguishable markers. 

VI . Comparative Analysis ( V- iewS o c r 1 *) 

The present invention provides a method of 
comparative analysis and visualization of multiple 

3 5 experiments. The method allows the intensity ratio, 

reference, and statistical methods to be run on multiple 
datafiles simultaneously. This permits different experimental 
conditions, sample preparations, and analysis parameters to be 
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compared in terms of their effects on sequence calling. The 
method also provides verification and editing functions, which 
are essential to reading sequences, as well as navigation and 
analysis tools. 
5 Fig. 8 illustrates the main screen and the 

associate^ jpull^oj/n ^en^^^r^^oit^ar^tive analysis and 
visualization* of multiple experiments^ The windows shown are 
from an appropriately programmed Sun Workstation. However, 
the comparative analysis software may also be implemented on 

10 or ported to a personal computer, including IBM PCs and 

compatibles, or other workstation environments. A window 802 
is shown having pull down menus for the following functions: 
File 804, Edit 806, View 808, Highlight 810, and Help 812. 

The main section of the window is divided into a 

15 reference sequence area 814 and a sample sequence area 816. 
The reference sequence area is where known sequences are 
displayed and is divided into a reference name subarea 818 and 
reference base subarea 82 0. The reference name subarea is 
shown with filenames that contain the reference sequences. 

2 0 The chip wild-type is identified by the filename with the 

extension ".wt# ,f where the # indicates a unit on the chip. 
The reference base subarea contains the bases of the reference 
sequences. A capital C 822 is displayed to the right of the 
reference sequence that is the chip wild-type for the current 
25 analysis. Although the chip wild-type sequence has associated 
fluorescence intensities, the other reference sequences shown 
below the chip wild-type may be known sequences that have not 
been tiled on the chip. These may or may not have associated 
fluorescence intensities. The reference sequences other than 

3 0 the chip wild-type are used for sequence comparisons and may 

be in the form of simple ASCII text files. 

Sample sequence area 816 is where sample or unknown 
experimental sequences are displayed for comparison with the 
reference sequences. The sample sequence area is divided into 
35 a sample name subarea 824 and sample base subarea 826. The 
sample name subarea is shown with filenames that contain the 
sample sequences. The filename extensions indicate the method 
used to call the sample sequence where ".cq#" denotes the 
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intensity ratio method, " .rq#" denotes the reference method, 
and " . sq#" denotes the statistical method (# indicates the 
unit on the chip) . The sample base subarea contains the bases 
of the sample sequences. The bases of the sample sequences 
are identified by the codes previously set forth which, for 
the most part, conform to the IUPAC standard. 

Window 802 also contains a message panel 828. When 
the user selects a base with an input device in the reference 
or sample base subarea, the base becomes highlighted and the 
pathname of the file containing the base is displayed in the 
message panel. The base's position in the nucleic acid 
sequence is also displayed in the message panel. 

In pull down menu File 804, the user is able to load 
files of experimental sequences that have been tiled and 
scanned on a chip. There is a chip wild-type associated with 
each experimental sequence. The chip wild-type associated 
with the first experimental sequence loaded is read and shown 
as the chip wild-type in reference sequence area 814. The 
user is also able to load files of known nucleic acid 
sequences as reference sequences for comparison purposes. As 
before, these known reference sequences may or may not have 
associated probe intensity data. Additionally, in this menu 
the user is able to save sequences that are selected on the 
screen into a project file that can be loaded in at a later 
time. The project file also contains any linkage of the 

.sequences, x/here sequences are linked for comparison 

J^JUn^J&A T7> fit* /)<UtjtJL s 

individual sequ erie^fa-T* both reference and sample, are 

A 

by selecting the sequence filename with an input device in the 
reference or sample name subareas. 

In pull down menu Edit 806, the user is able to link 
together sequences in the reference and sample sequence areas. 
After the user has selected one reference and one or more 
sample sequences, the sample sequences can be linked to the 
reference sequence by selecting an entry in the pull down 
menu. Once the sequences are linked, a link number 8 30 is 
displayed next to each of the linked sequences. Each group of 
linked sequences is associated with a unique link number, so 
the user can easily identify which sequences are linked 
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together. Linking sequences ^permits the user to more easily 
Ik, compare the linked aoquoncQC The user is also able to remove 

and display links in this menu. 

In pull down menu View 808, the user is able to 
5 display intensity graphs for selected bases. Once a base is 
selected in the reference or sample base subareas, the user 
may request an intensity graph showing the hybridized probe 
intensities of the selected base and a delineated neighborhood 
of bases near the selected base. Intensity graphs may be 
10 displayed for one or multiple selected bases. The user is 
Qy also able to prepare comment files and reports/ln- this menu. 

A 

Fig. 9 illustrates an intensity graph window for a 

C$H ^ o: J>0 g> Ado: i O . * . . 

^ selected base at position 120^ The filename containing the 

sequence data is displayed at 904. The graph shows the 
15 intensities for each of the hybridized probes associated with 
a base. Each grouping of four vertical bars on the graph, 
which are labeled as "a", "c", "g" , and "t" on line 906, shows 
the background subtracted intensities of probes having the 
indicated substitution base. In one embodiment, the called 
20 bases are shown in red. The wild-type base is shown at line 
908, the called base is shown at line 910, and the base 
position is shown at line 912. In Fig. 9, the base selected 
^ is at position 120 ; as shown by arrow 914. The wild-type base 

at this position is T; however, the called base is M which 
25 means the base is either A or C (amino) . The user is able to 
use intensity graphs to visually compare the intensities of 
each of the possible calls. 

Fig. 10 illustrates multiple intensity graph ^windows 



„ .w <9* F \9*< 10 illustrates multiple intensity graph >windo\ 
^ for selected bases^. There are three) intensity graph windows 
30 1002, 1004, and 1006 as shown. Each window may be associated 
with a different experiment, where the sequence analyzed in 
the experiment may be either a reference (if it has associated 
probe intensity data as in the chip wild-type) or a sample 
sequence. The windows are aligned and a rectangular box 1008 
35 shows the selected bases' position in each of the sequences 

(position 162 in Fig. 10) . The rectangular box aids the user 
in identifying the selected bases. 
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Referring again to Fig. 8, in pull down menu 
Highlight 810, the user is able to compare the sequences of 
references and samples. At least four comparisons are 
available to the user, including the following: sample 
5 sequences to the chip wild-type sequence, sample sequences to 
any reference sequences, sample sequences to any linked 
reference sequences, and reference sequences to the chip wild- 
type sequence. For example, after the user has linked a 
reference and sample sequence, the user can compare the bases 

10 in the linked sequences. Bases in the sample sequence that 
are different from the reference sequence will then be 
indicated on the display device to the user (e.g., base is 
shown in a different color) . In another example, the user is 
able to perform a comparison that will help identify sample 

15 sequences. After a sample is linked to multiple reference 
sequences, each base in the sample sequence that does not 
match the wild-type sequence is checked to see if it matches 
one of the linked reference sequences. The bases that match a 
linked reference sequence will then be indicated on the 

20 display device to the user. The user may then more easily 
identify the sample sequence as being one of the reference 
sequences . 

In pull down menu Help 812, the user is able to get 
information and instructions regarding the comparative 
25 analysis program, the calling methods, and the IUPAC 
definitions used in the program. 
\ Fig. 11 illustrates the intensity ratio method 

\ correctly calling a mutation in solutions with varying 
?Jhjb GSy concentrationsAAjondow 1102 is shown with a chip wild-type 
30 / 1104 and a mutant sequence 1106. The mutant sequence differs 
/ from the chip wild-type at the position indicated by the 

rectangular box 1108. The chip wild-type and mutant sequences 
are a region of HIV Pol Gene spanning mutations occurring in 
AZT drug therapy. 
35 There are seven sample sequences that are called 

using the intensity ratio method. The sample sequences are 
actually solutions of different proportions of the chip wild- 
type sequence and the mutant sequence. Thus, there are sample 



solutions 1110, 1112, 1114, 1116, 1118, 1120, and 1122. The 
solutions are 15-mer tilings across the chip wild-type with 
increased percentages of the mutant sequence from 0 to 100% by 
weight. The following shows the proportions of the sample 
solutions : 

Sample Solution Chip Wild-Type; Mutant 

1110 100:0 
1112 90:10 
1114 75:25 
1116 50:50 
1118 25:75 
1120 10:90 
^ 1122 0:100 

For example, sample solution 1114 contains 75% chip wild-type 
sequence and 25% mutant sequence. 

Now referring to the bases called in rectangular box 
1108 for the sample solutions, the intensity ratio method 
correctly calls sample solution 1110 as having a base A as in 
the chip-wild type sequence. This is correct because sample 
solution 1110 is 100% chip wild-type sequence. The intensity 
ratio method also calls sample solution 1112 as having a base 
A because the sample solution is 90% chip wild-type sequence. 

The intensity ratio method calls the identified base 
in sample solutions 1114 and 1116 as being an R, which is an 
ambiguity IUPAC code denoting A or G (purine) . This also a 
correct base call because the sample solutions have from 75% 
to 50% chip-wild type sequence and from 25% to 50% mutation 
sequence. Thus, the intensity ratio method correctly calls 
the base in this transition state. 

Sample solutions 1118, 1120, and 1122 are called by 
the intensity ratio method as having a mutation base G at the 
specified location. This is a correct base call because the 
sample solutions primarily consist of the mutation sequence 
(75%, 90%, and 100% respectively). Again, the intensity ratio 
method correctly called the bases. 

These experiments also show that the base calling 
methods of the present invention may also be used for 
solutions of more than one nucleic acid sequence. 

Fig. 12 illustrates the reference method correctly 
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graph windows 1202 , 1204, and 1206 as shown. The windows are 
aligned and a rectangular box 1208 outlines the bases of 
interest. Window 1202 shows a sample sequence called using 
the intensity ratio method. However, the base in the A > 
5 rectangular box 1208 was incorrectly called base «C beca use- 
there is actually a base A at that position. The intensity 
ratio method incorrectly called the base as C because the 
probe intensity associated with base C is much higher than the 
other probe intensities. 

10 Window 1204 shows a reference sequence called using 

the intensity ratio method. As the reference sequence is 
known, it is not necessary to know the method used to call the 
reference sequence. However, it is important to have probe 
intensities for a reference sequence tp use the reference 

15 method. The reference sequence "teas-a base C at the position 
indicated by the rectangular box. 

Window 1206 shows the sample sequence called using 
the reference method. The reference method correctly calls 
the specified base as being base A. Thus, for some cases the 

20 reference method is preferable to the intensity ratio method 
because it compares probe intensities of a sample sequence to 
probe intensities of a reference sequence. 
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VII. Examples 
Example 1 



The intensity ratio method was used in sequence 
analysis of various polymorphic HIV-1 clones using a protease 
chip.. Single stranded DNA of a 382 nt region was used with 4 
30 different clones (HXB2, SF2 , NY5, pPol4mutl8) . Results were 
compared to results from an ABI sequencer. The results are 
illustrated below: 
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ABI Protease Chip 



Sense Antisense Sense Antisense 



No call 0 4 9 4 

Ambiguous 6 14 17 8 

Wrong call 2 3 3 1 

TOTAL 8 21 29 13 



SUMMARY 

ABI (sense) - 99.5% 
Chip (sense) - 98.1% 

ABI (antisense) - 98.6% 
Chip (antisense) - 99.1% 



Example 2 

HIV protease genotyping was performed using the 
described chips and Ca41S e g~ intensity ratio calculations. 
Samples were evaluated from AIDS patients before and after ddl 
treatment. Results were confirmed with ABI sequencing. 

Fig. 13 illustrates the output of the ViewSeq™ 
program with four pretreatment samples and four posttreatment 



samples^ Note the ^Sotation ^V position 207 where a mutation 
has arisen. Even adjacent two additional mutations (gt) , the 
"a" mutation has been properly detected. 

VIII. Appendices 

The Microfiche appendices (copyright Affymetrix, 
Inc.) provide C++ source code and header files for 
implementing the present invention. Appendix A contains the 
source code files (.cc files) for e€tMr&eq™, which is a base 
calling program that implements the intensity ratio, 
reference, and statistical methods of the present invention. 
Appendix B contains the header files (.h files) for GaiiSeq™. 
Appendices C and D contain the source code and header files, 
respectively, for a program that performs a^jpreprocessing 
stage for the statistical method of GallSe ^™. 

^ lEV j^g^ggendix E contains the source code and header files 
for ViewS^q™, which is a comparative analysis and 
visualization program according to the present invention. 
Appendices A-E are written for a Sun Workstation. 
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The above description is illustrative and not 
restrictive. Many variations of the invention will become 
apparent to those of skill in the art upon review of this 
disclosure. Merely by way of example, while the invention is 
illustrated with particular reference to the evaluation of DNA 
(natural or unnatural) , the methods can be used in the 
analysis from chips with other materials synthesized thereon, 
such as RNA. The scope of the invention should, therefore, be 
determined not with reference to the above description, but 
instead should be determined with reference to the appended 
claims along with their full scope of equivalents. 
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